Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip XML element text #182

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Strip XML element text #182

wants to merge 1 commit into from

Conversation

cbenz
Copy link

@cbenz cbenz commented Jun 12, 2024

Hi,

I downloaded BIS SDMX files using their bulk download website (https://data.bis.org/static/bulk/{dataset_id}_sdmx-compact-2.1.zip, for example https://data.bis.org/static/bulk/WS_CBPOL_sdmx-compact-2.1.zip ).

Because those XML files are non-indented (they only have 1 line), I re-indented them using xmlindent, and now read_sdmx does not work anymore, because of the \n inserted in element.text:


--- SS without structure ---
1 (140191632947744) False

--- <class 'sdmx.message.StructureMessage'> ---
2 (140189840089040) <sdmx.StructureMessage>
  <Header>
    source: 
    test: False

--- ID ---
3 (140189838191248) IDREFc71e5f8d-5182-411e-8910-40010bd38ab1
    

--- Test ---
4 (140191565313648) false
    


Ignore:
 {140191632913024}
<mes:Prepared xmlns:mes="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:str="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/structure" xmlns:com="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/common">2024-06-10T14:35:22Z
    </mes:Prepared>
    

2024-06-11 01:06:47,774 CRITICAL root: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/sdmx/reader/xml/common.py", line 204, in read_message
    result = func(self, element)
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sdmx/reader/xml/v21.py", line 362, in _datetime
    reader.push(elem, isoparse(text))
                      ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/dateutil/parser/isoparser.py", line 37, in func
    return f(self, str_in, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/dateutil/parser/isoparser.py", line 138, in isoparse
    components += self._parse_isotime(dt_str[pos + 1:])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/dateutil/parser/isoparser.py", line 346, in _parse_isotime
    components[-1] = self._parse_tzstr(timestr[pos:])
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/dateutil/parser/isoparser.py", line 395, in _parse_tzstr
    raise ValueError('Time zone offset requires sign')
ValueError: Time zone offset requires sign

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/download.py", line 8, in <module>
    main()
  File "/app/src/bis_fetcher/download.py", line 12, in main
    Downloader(downloader_helper=downloader_helper).start()
  File "/app/src/bis_fetcher/downloader.py", line 92, in start
    for dataset_id, dataflow_definition in self._source_data_loader.load_dataflow().items():
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/src/bis_fetcher/source_data_loader.py", line 73, in load_dataflow
    dataflow_msg = self._parse_sdmx_structure_message(self.dataflow_file)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/src/bis_fetcher/source_data_loader.py", line 144, in _parse_sdmx_structure_message
    msg = sdmx.read_sdmx(input_file)  # type:ignore[no-untyped-call]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sdmx/reader/__init__.py", line 126, in read_sdmx
    return reader().read_message(obj, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sdmx/reader/xml/__init__.py", line 45, in read_message
    import_module(f"sdmx.reader.xml.{version}")
  File "/usr/local/lib/python3.12/site-packages/sdmx/reader/xml/common.py", line 221, in read_message
    raise XMLParseError from exc
sdmx.exceptions.XMLParseError: ValueError: Time zone offset requires sign

To add tests, I have seen the sdmx-test-data repository, but did not want to modify those files directly. How would you proceed about tests?

PR checklist

  • Checks all ✅
  • Update documentation
  • Update doc/whatsnew.rst

Copy link

codecov bot commented Jun 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.87%. Comparing base (07d1e74) to head (47c87cf).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #182      +/-   ##
==========================================
- Coverage   98.88%   97.87%   -1.01%     
==========================================
  Files          95       95              
  Lines        7538     7538              
==========================================
- Hits         7454     7378      -76     
- Misses         84      160      +76     
Files Coverage Δ
sdmx/reader/xml/v21.py 99.15% <100.00%> (-0.15%) ⬇️
sdmx/reader/xml/v30.py 95.19% <100.00%> (ø)

... and 20 files with indirect coverage changes

@khaeru
Copy link
Owner

khaeru commented Jun 12, 2024

Thanks for this contribution and the explanation of the issue it's intended to address.

For the specimens:

  • You can indeed open a separate PR on khaeru/sdmx-test-data that adds files. That would be very welcome. See Add specimen for khaeru/sdmx#161 sdmx-test-data#1 for an example (the only one thus far!).
  • If you not able/confident to do so, you could also give the files as attachments to another comment on this GitHub PR (docs).
  • In either case, please provide both (a) one of the raw files and (b) one of the files you have processed with xmlindent. If the files are large, please delete some other elements besides the ones that trigger the error.

With those specimens, I will try to validate the indented file against the official XSD schemas. In particular, I will check whether the standards/schemas actually allow that an XML element with leading or trailing whitespace is valid:

...
<URN>
  urn:sdmx:etc.etc=etc(1.0)
</URN>
...

If it is valid, then I think your changes in this PR are the right fix and definitely should be accepted.
If it is not valid, then I think the code should additionally warn that invalid SDMX-ML is being handled; and the docs should warn that other tools (for instance xmlindent) can output this technically invalid SDMX-ML. We could extend your PR to add such warnings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants